Bayesian classification and entropy for promoter prediction in human DNA sequences
نویسندگان
چکیده
There is now a large amount of genomic data available in databases for researchers. Computational methods are yet available for data retrieval and analysis, including sequences similarity searches, structural and functionnal predictions. Computationnal detection of genes has received an important interest and many accurate methods are available. However, other functionnal sites are more difficult to characterize. In this work, we examine the potential of entropy and bayesian tools for promoter localization in human DNA sequences. Promoters are regulatory regions (at least one for each gene, located near the first exon) that governs the expression of genes, and their prediction is reputed difficult, so that this issue is still open. To process DNA sequences it is useful to convert them using numerical representation that preserve their statistical properties. We choose the Chaos Game representation (CGR) [Jeffrey1990] of DNA sequences which has interesting properties: the source sequence can be recovered uniquely from the CGR transcription and the distance between CGR position measures similarity between corresponding sequences. This representation is applied to sequences of “words” of variable length (number of elementary bases). Typically we used words from 1 to 6 nucleotides. Using this CGR we have put in evidence the non stationarity of the genome: coding, promoter or genomic regions of DNA result in different CGR matrices. In particular we observe the fractal depletion in CG for genomic regions (that is under-representation of CG words) and CG "islands" in about 80% of promoters. In order to analyse DNA sequences, references probabilities of the genomic, coding and promoters background are built using data from public databases. We also estimate “local” probability distribution functions, using a sliding window, and a forgetting factor. We built a naïve bayesian classifier for promoter detection, by testing the likelihood ratio promoter/genomic or promoter/coding of the sequence at hand. Results show that performance is interesting when the window is located near the TSS – Transcription Start Site, and the window length is less than 200 bases. Such a classifier has already be useful for classifying species as in [Sandberg2001]. Local probabilities were used to evaluate (i) the local entropy of the sequence, (ii) the Kullback divergence to the background (with respect to the hypothesis on the nature – genomic or promoter, of the background). Again, our experiments showed that these indicators clearly reveal the core-promoter and TSS positions in many cases. However, we also noticed, as was already pointed in the litterature [Hannenhalli2001,Zhang2003], that the set of promoters can be divided in (at least) two classes, the first one (with high CG ratio) being relatively easy to predict, while the second (that may in fact be divided in more subclasses) gives more mitigated results. An interesting point is that a promoter prediction tool can assess or infirm the bioinformatic prediction of a gene. Such examples will be presented at the conference.
منابع مشابه
Bayesian classification for promoter prediction in human DNA sequences
Many Computational methods are yet available for data retrieval and analysis of genomic sequences, but some functional sites are difficult to characterize. In this work, we examine the problem of promoter localization in human DNA sequences. Promoters are regulatory regions that governs the expression of genes, and their prediction is reputed difficult, so that this issue is still open. We pres...
متن کاملA Validation Test Naive Bayesian Classification Algorithm and Probit Regression as Prediction Models for Managerial Overconfidence in Iran's Capital Market
Corporate directors are influenced by overconfidence, which is one of the personality traits of individuals; it may take irrational decisions that will have a significant impact on the company's performance in the long run. The purpose of this paper is to validate and compare the Naive Bayesian Classification algorithm and probit regression in the prediction of Management's overconfident at pre...
متن کاملComparison of Promoter Sequences of Flowering Control Genes, FT1 and Three Versions of VIN3, in Susceptible and Resistant Sugar Beet Genotypes to Bolting
Autumn sowing of sugar beet is a suitable way in sustainable agriculture. Bolting is an undesirable phenomenon which reduces sugar beet yield and it is the most important limiting factor in autumn sowing of sugar beet. Identification and comparison of the sequence of flowering genes in various genotypes can help to understand the molecular mechanisms controlling bolting. In the previous studies...
متن کاملIn silico screening of G-Quadruplex Structures in Wilms tumor 1 Gene Promoter
Introduction: X-ray diffraction studies have revealed that guanines in a DNA stands may be arranged in quartet and form a structure called G-quadruplexs. Bioinformatics studies suggested the formation of G-quadruplex structure in human crucial genes, including Wilms tumor 1 (WT1). The aim of this study was to in silico analysis of the guanine-rich sequence in the promoter region of the WT1 gene...
متن کاملMolecular and Bioinformatics Analysis of Allelic Diversity in IGFBP2 Gene Promoter in Indigenous Makuee and Lori-Bakhtiari Sheep Breeds
The aim of this study was to perform molecular and bioinformatics analysis of IGFBP2 gene promoter in association with some economic traits in indigenous Makuee (MS) and Lori-Bakhtiari (LB) breeds. DNA was extracted from blood samples of 120 MS and 200 LB and a 297 bp fragment from the upstream sequences of studied gene was amplified and genotyped by single-strand conformational polymo...
متن کامل